Goto

Collaborating Authors

 with-replacement sampling and convexity


New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity

Neural Information Processing Systems

As an incremental-gradient algorithm, the hybrid stochastic gradient descent (HSGD) enjoys merits of both stochastic and full gradient methods for finite-sum minimization problem. However, the existing rate-of-convergence analysis for HSGD is made under with-replacement sampling (WRS) and is restricted to convex problems. It is not clear whether HSGD still carries these advantages under the common practice of without-replacement sampling (WoRS) for non-convex problems. In this paper, we affirmatively answer this open question by showing that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent. For a special class of finite-sum problems with linear prediction models, our convergence results can be further improved in some cases. Extensive numerical results confirm our theoretical affirmation and demonstrate the favorable efficiency of WoRS-based HSGD.


Reviews: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity

Neural Information Processing Systems

This paper lies in the well-developed field of variance-reduced stochastic gradient algorithms. It proposes a theoretical study of the well-known HSGD for sampling without replacement scheme, which is known to perform better than sampling with replacement in practice. The considered setting makes the analysis of the algorithm more difficult, since in this case the mini-batch gradients are not unbiased anymore. Rates for strongly-convex, non-strongly convex and non-convex objectives are provided, with a special care for linearly structured problems (such as generalized linear models). Numerical experiments are convincing enough, although the datasets used in experiments are not very large (largest one is covtype) (while authors argue that this methods is interesting in the large n medium precision setting).